Predictive Policing - Technical Implementation

MUSA 5080 - Fall 2025

Author

Jinheng

Published

November 3, 2001

About This Exercise

In this exercise, you will build a spatial predictive model for burglaries using count regression and spatial features.

Learning Objectives

By the end of this exercise, you will be able to:

  1. Create a fishnet grid for aggregating point-level crime data
  2. Calculate spatial features including k-nearest neighbors and distance measures
  3. Diagnose spatial autocorrelation using Local Moran’s I
  4. Fit and interpret Poisson and Negative Binomial regression for count data
  5. Implement spatial cross-validation (Leave-One-Group-Out)
  6. Compare model predictions to a Kernel Density Estimation baseline
  7. Evaluate model performance using appropriate metrics

Setup

Code
# Load required packages
library(tidyverse)      # Data manipulation
library(sf)             # Spatial operations
library(here)           # Relative file paths
library(viridis)        # Color scales
library(terra)          # Raster operations (replaces 'raster')
library(spdep)          # Spatial dependence
library(FNN)            # Fast nearest neighbors
library(MASS)           # Negative binomial regression
library(patchwork)      # Plot composition (replaces grid/gridExtra)
library(knitr)          # Tables
library(kableExtra)     # Table formatting
library(classInt)       # Classification intervals
library(here)

# Spatstat split into sub-packages
library(spatstat.geom)    # Spatial geometries
library(spatstat.explore) # Spatial exploration/KDE

# Set options
options(scipen = 999)  # No scientific notation
set.seed(5080)         # Reproducibility

# Create consistent theme for visualizations
theme_crime <- function(base_size = 11) {
  theme_minimal(base_size = base_size) +
    theme(
      plot.title = element_text(face = "bold", size = base_size + 1),
      plot.subtitle = element_text(color = "gray30", size = base_size - 1),
      legend.position = "right",
      panel.grid.minor = element_blank(),
      axis.text = element_blank(),
      axis.title = element_blank()
    )
}

# Set as default
theme_set(theme_crime())

cat("✓ All packages loaded successfully!\n")
✓ All packages loaded successfully!
Code
cat("✓ Working directory:", getwd(), "\n")
✓ Working directory: /Users/jinhengcen/Documents/25fall/PPA/portfolio-setup-CenJinHeng/assignments/assignment_4 

Part 1: Load and Explore Data

Exercise 1.1: Load Chicago Spatial Data

Code
# Load police districts (used for spatial cross-validation)
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(District = dist_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/24zt-jpfn?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 25 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load police beats (smaller administrative units)
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%
  dplyr::select(Beat = beat_num)
Reading layer `OGRGeoJSON' from data source 
  `https://data.cityofchicago.org/api/geospatial/n9it-hstw?method=export&format=GeoJSON' 
  using driver `GeoJSON'
Simple feature collection with 277 features and 4 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -87.94011 ymin: 41.64455 xmax: -87.52414 ymax: 42.02303
Geodetic CRS:  WGS 84
Code
# Load Chicago boundary
chicagoBoundary <- 
  st_read("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson") %>%
  st_transform('ESRI:102271')
Reading layer `chicagoBoundary' from data source 
  `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA/Chapter5/chicagoBoundary.geojson' 
  using driver `GeoJSON'
Simple feature collection with 1 feature and 1 field
Geometry type: POLYGON
Dimension:     XY
Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
Geodetic CRS:  WGS 84
Code
cat("✓ Loaded spatial boundaries\n")
✓ Loaded spatial boundaries
Code
cat("  - Police districts:", nrow(policeDistricts), "\n")
  - Police districts: 25 
Code
cat("  - Police beats:", nrow(policeBeats), "\n")
  - Police beats: 277 
NoteCoordinate Reference System

We’re using ESRI:102271 (Illinois State Plane East, NAD83, US Feet). This is appropriate for Chicago because:

  • It minimizes distortion in this region
  • Uses feet (common in US planning)
  • Allows accurate distance calculations

Exercise 1.2: Load Burglary Data

Code
# Load from provided data file (downloaded from Chicago open data portal)
burglaries <- st_read("data/burglaries.shp") %>% 
  st_transform('ESRI:102271')
Reading layer `burglaries' from data source 
  `/Users/jinhengcen/Documents/25fall/PPA/portfolio-setup-CenJinHeng/assignments/assignment_4/data/burglaries.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 7482 features and 22 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 340492 ymin: 552959.6 xmax: 367153.5 ymax: 594815.1
Projected CRS: NAD83(HARN) / Illinois East
Code
# Check the data
cat("\n✓ Loaded burglary data\n")

✓ Loaded burglary data
Code
cat("  - Number of burglaries:", nrow(burglaries), "\n")
  - Number of burglaries: 7482 
Code
cat("  - CRS:", st_crs(burglaries)$input, "\n")
  - CRS: ESRI:102271 
Code
cat("  - Date range:", min(burglaries$date, na.rm = TRUE), "to", 
    max(burglaries$date, na.rm = TRUE), "\n")
  - Date range: Inf to -Inf 

Question 1.1: How many burglaries are in the dataset? What time period does this cover? Why does the coordinate reference system matter for our spatial analysis?

  • There are 152664 burglaries in my data set (19734 in 2017). I chose Sanitation Code Complaints.

  • This data set covers period between 2000 and 2018.

  • Because the Earth is curved in the reality, yet we attempt to analyze space on a flat xy plane. We use projection to define this transformation process. Due to differences in projection methods, the data points also vary in location. Therefore, we must employ a consistent projection method to ensure the accuracy of data positioning. We also need to determine the appropriate projection method based on different geographic locations and the requirements for calculating distances.

WarningCritical Pause #1: Data Provenance

Before proceeding, consider where this data came from:

Who recorded this data? Chicago Police Department officers and detectives

What might be missing?

  • Unreported burglaries (victims didn’t call police)
  • Incidents police chose not to record
  • Downgraded offenses (burglary recorded as trespassing)
  • Spatial bias (more patrol = more recorded crime)

Think About Was there a Department of Justice investigation of CPD during this period? What did they find about data practices?

Exercise 1.3: Visualize Point Data

Code
# Simple point map
p1 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_sf(data = burglaries, color = "#d62828", size = 0.1, alpha = 0.4) +
  labs(
    title = "Burglary Locations",
    subtitle = paste0("Chicago 2017, n = ", nrow(burglaries))
  )

# Density surface using modern syntax
p2 <- ggplot() + 
  geom_sf(data = chicagoBoundary, fill = "gray95", color = "gray60") +
  geom_density_2d_filled(
    data = data.frame(st_coordinates(burglaries)),
    aes(X, Y),
    alpha = 0.7,
    bins = 8
  ) +
  scale_fill_viridis_d(
    option = "plasma",
    direction = -1,
    guide = "none"  # Modern ggplot2 syntax (not guide = FALSE)
  ) +
  labs(
    title = "Density Surface",
    subtitle = "Kernel density estimation"
  )

# Combine plots using patchwork (modern approach)
p1 + p2 + 
  plot_annotation(
    title = "Spatial Distribution of Burglaries in Chicago",
    tag_levels = 'A'
  )

Question 1.2: What spatial patterns do you observe? Are burglaries evenly distributed across Chicago? Where are the highest concentrations? What might explain these patterns?

  • Burglaries are not evenly distributed. There are two clusters in the density map. Lincoln Square and South Side are the highest concentrations (told from Google map). Since I am not familiar with Chicago, I just assumed that these are suburban neighbors that have lower revenue that leads to poorly maintenance, according to “broken window theory”, there are likely more burglaries.

Part 2: Create Fishnet Grid

Exercise 2.1: Understanding the Fishnet

A fishnet grid converts irregular point data into a regular grid of cells where we can:

  • Aggregate counts
  • Calculate spatial features
  • Apply regression models

Think of it as overlaying graph paper on a map.

Code
# Create 500m x 500m grid
fishnet <- st_make_grid(
  chicagoBoundary,
  cellsize = 500,  # 500 meters per cell
  square = TRUE
) %>%
  st_sf() %>%
  mutate(uniqueID = row_number())

# Keep only cells that intersect Chicago
fishnet <- fishnet[chicagoBoundary, ]

# View basic info
cat("✓ Created fishnet grid\n")
✓ Created fishnet grid
Code
cat("  - Number of cells:", nrow(fishnet), "\n")
  - Number of cells: 2458 
Code
cat("  - Cell size:", 500, "x", 500, "meters\n")
  - Cell size: 500 x 500 meters
Code
cat("  - Cell area:", round(st_area(fishnet[1,])), "square meters\n")
  - Cell area: 250000 square meters

Question 2.1: Why do we use a regular grid instead of existing boundaries like neighborhoods or census tracts? What are the advantages and disadvantages of this approach?

  • We need fishnet to covert data points into areas that can be calculated for our modeling
  • Fishnet has a consistent size, so there are no boundary bias. It is a standard approach commonly used in policy analysis. Additionally, It is easy to create and aggregate point data. Howerver, it might split natural areas.

Exercise 2.2: Aggregate Burglaries to Grid

Code
# Spatial join: which cell contains each burglary?
burglaries_fishnet <- st_join(burglaries, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(countBurglaries = n())

# Join back to fishnet (cells with 0 burglaries will be NA)
fishnet <- fishnet %>%
  left_join(burglaries_fishnet, by = "uniqueID") %>%
  mutate(countBurglaries = replace_na(countBurglaries, 0))

# Summary statistics
cat("\nBurglary count distribution:\n")

Burglary count distribution:
Code
summary(fishnet$countBurglaries)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   2.000   3.042   5.000  40.000 
Code
cat("\nCells with zero burglaries:", 
    sum(fishnet$countBurglaries == 0), 
    "/", nrow(fishnet),
    "(", round(100 * sum(fishnet$countBurglaries == 0) / nrow(fishnet), 1), "%)\n")

Cells with zero burglaries: 781 / 2458 ( 31.8 %)
Code
# Visualize aggregated counts
ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "Burglaries",
    option = "plasma",
    trans = "sqrt",  # Square root for better visualization of skewed data
    breaks = c(0, 1, 5, 10, 20, 40)
  ) +
  labs(
    title = "Burglary Counts by Grid Cell",
    subtitle = "500m x 500m cells, Chicago 2017"
  ) +
  theme_crime()

Question 2.2: What is the distribution of burglary counts across cells? Why do so many cells have zero burglaries? Is this distribution suitable for count regression? (Hint: look up overdispersion)

  • Cells with high amount concentrated in nearby areas, reflecting the first law of geography.

  • Reasons explaining cells that are zero:

    • They are in natural areas like parks, rivers, where few burglaries happened

    • Data might be missed due to errors in data collecting, cleaning, transforming process

  • Our assumption is that variance = mean, however, the reality is that Variance > Mean, cuz we have a lot cells counted zero, making the distribution not suitable for our regression model.

Part 3: Create a Kernel Density Baseline

Before building complex models, let’s create a simple baseline using Kernel Density Estimation (KDE).

The KDE baseline asks: “What if crime just happens where it happened before?” (simple spatial smoothing, no predictors)

Code
# Convert burglaries to ppp (point pattern) format for spatstat
burglaries_ppp <- as.ppp(
  st_coordinates(burglaries),
  W = as.owin(st_bbox(chicagoBoundary))
)

# Calculate KDE with 1km bandwidth
kde_burglaries <- density.ppp(
  burglaries_ppp,
  sigma = 1000,  # 1km bandwidth
  edge = TRUE    # Edge correction
)

# Convert to terra raster (modern approach, not raster::raster)
kde_raster <- rast(kde_burglaries)

# Extract KDE values to fishnet cells
fishnet <- fishnet %>%
  mutate(
    kde_value = terra::extract(
      kde_raster,
      vect(fishnet),
      fun = mean,
      na.rm = TRUE
    )[, 2]  # Extract just the values column
  )

cat("✓ Calculated KDE baseline\n")
✓ Calculated KDE baseline
Code
ggplot() +
  geom_sf(data = fishnet, aes(fill = kde_value), color = NA) +
  geom_sf(data = chicagoBoundary, fill = NA, color = "white", linewidth = 1) +
  scale_fill_viridis_c(
    name = "KDE Value",
    option = "plasma"
  ) +
  labs(
    title = "Kernel Density Estimation Baseline",
    subtitle = "Simple spatial smoothing of burglary locations"
  ) +
  theme_crime()

Question 3.1: How does the KDE map compare to the count map? What does KDE capture well? What does it miss?

  • It better reflects regional characteristics across multiple grids, and it smooths out outliers (like zero cells).

  • It lost the details in local level. Some zero cells in the count map show very high value here due to the influence of nearby high-value cells.

TipWhy Start with KDE?

The KDE represents our null hypothesis: burglaries happen where they happened before, with no other information.

Your complex model must outperform this simple baseline to justify its complexity.

We’ll compare back to this at the end!

Part 4: Create Spatial Predictor Variables

Now we’ll create features that might help predict burglaries. We’ll use “broken windows theory” logic: signs of disorder predict crime.

Exercise 4.1: Load 311 Abandoned Vehicle Calls

Code
abandoned_building <- read_csv("data/311_calls.csv")%>%
  filter(!is.na(LATITUDE), !is.na(LONGITUDE)) %>%
  st_as_sf(coords = c("LONGITUDE", "LATITUDE"), crs = 4326) %>%
  st_transform('ESRI:102271')

cat("✓ Loaded abandoned vehicle calls\n")
✓ Loaded abandoned vehicle calls
Code
cat("  - Number of calls:", nrow(abandoned_building), "\n")
  - Number of calls: 65073 
NoteData Loading Note

The data was downloaded from Chicago’s Open Data Portal. You can now request an api from the Chicago portal and tap into the data there.

Consider: How might the 311 reporting system itself be biased? Who calls 311? What neighborhoods have better 311 awareness?

Exercise 4.2: Count of Abandoned Cars per Cell

Code
# Aggregate abandoned car calls to fishnet
abandoned_fishnet <- st_join(abandoned_building, fishnet, join = st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID) %>%
  summarize(abandoned_building = n())

# Join to fishnet
fishnet <- fishnet %>%
  left_join(abandoned_fishnet, by = "uniqueID") %>%
  mutate(abandoned_building = replace_na(abandoned_building, 0))

cat("Abandoned car distribution:\n")
Abandoned car distribution:
Code
summary(fishnet$abandoned_building)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    1.00    7.00   26.47   28.00  378.00 
Code
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abandoned_building), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "magma") +
  labs(title = "Abandoned Buildings 311 Calls") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma") +
  labs(title = "Burglaries") +
  theme_crime()

p1 + p2 +
  plot_annotation(title = "Are abandoned buildings and burglaries correlated?")

Question 4.1: Do you see a visual relationship between abandoned buildings and burglaries? What does this suggest?

  • Spatial pattern is not perfectly matched in these two maps: Clusters are not matched between these two maps — There are four clusters of abandoned buildings, particularly concentrated in the south-central area, while residential burglaries are more widely dispersed across the southern and northern suburbs.

  • These maps suggest that abandoned buildings and burglaries might be strongly correlated, which should be alarmed for our future model building.

Exercise 4.3: Nearest Neighbor Features

Count in a cell is one measure. Distance to the nearest 3 abandoned cars captures local context.

Code
# Calculate mean distance to 3 nearest abandoned cars
# (Do this OUTSIDE of mutate to avoid sf conflicts)

# Get coordinates
fishnet_coords <- st_coordinates(st_centroid(fishnet))
abandoned_coords <- st_coordinates(abandoned_building)

# Calculate k nearest neighbors and distances
nn_result <- get.knnx(abandoned_coords, fishnet_coords, k = 3)

# Add to fishnet
fishnet <- fishnet %>%
  mutate(
    abandoned_building.nn = rowMeans(nn_result$nn.dist)
  )

cat("✓ Calculated nearest neighbor distances\n")
✓ Calculated nearest neighbor distances
Code
summary(fishnet$abandoned_building.nn)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
   0.8906   78.5322  154.3599  261.9420  313.4905 1997.1337 

Question 4.2: What does a low value of abandoned_building.nn mean? A high value? Why might this be informative?

  • abandoned_building.nn showing mean distance to 3 nearest abandoned buildings. A low value reflects that abandoned buildings are close to each other, while a high value shows that they are far and more spread. This number can give us a simple sense of the distribution rule of our data.

Exercise 4.4: Distance to Hot Spots

Let’s identify clusters of abandoned cars using Local Moran’s I, then calculate distance to these hot spots.

Code
# Function to calculate Local Moran's I
calculate_local_morans <- function(data, variable, k = 5) {
  
  # Create spatial weights
  coords <- st_coordinates(st_centroid(data))
  neighbors <- knn2nb(knearneigh(coords, k = k))
  weights <- nb2listw(neighbors, style = "W", zero.policy = TRUE)
  
  # Calculate Local Moran's I
  local_moran <- localmoran(data[[variable]], weights)
  
  # Classify clusters
  mean_val <- mean(data[[variable]], na.rm = TRUE)
  
  data %>%
    mutate(
      local_i = local_moran[, 1],
      p_value = local_moran[, 5],
      is_significant = p_value < 0.05,
      
      moran_class = case_when(
        !is_significant ~ "Not Significant",
        local_i > 0 & .data[[variable]] > mean_val ~ "High-High",
        local_i > 0 & .data[[variable]] <= mean_val ~ "Low-Low",
        local_i < 0 & .data[[variable]] > mean_val ~ "High-Low",
        local_i < 0 & .data[[variable]] <= mean_val ~ "Low-High",
        TRUE ~ "Not Significant"
      )
    )
}

# Apply to abandoned building
fishnet <- calculate_local_morans(fishnet, "abandoned_building", k = 5)
Code
# Visualize hot spots
ggplot() +
  geom_sf(
    data = fishnet, 
    aes(fill = moran_class), 
    color = NA
  ) +
  scale_fill_manual(
    values = c(
      "High-High" = "#d7191c",
      "High-Low" = "#fdae61",
      "Low-High" = "#abd9e9",
      "Low-Low" = "#2c7bb6",
      "Not Significant" = "gray90"
    ),
    name = "Cluster Type"
  ) +
  labs(
    title = "Local Moran's I: Abandoned buildings Clusters",
    subtitle = "High-High = Hot spots of disorder"
  ) +
  theme_crime()

Code
# Get centroids of "High-High" cells (hot spots)
hotspots <- fishnet %>%
  filter(moran_class == "High-High") %>%
  st_centroid()

# Calculate distance from each cell to nearest hot spot
if (nrow(hotspots) > 0) {
  fishnet <- fishnet %>%
    mutate(
      dist_to_hotspot = as.numeric(
        st_distance(st_centroid(fishnet), hotspots %>% st_union())
      )
    )
  
  cat("✓ Calculated distance to abandoned car hot spots\n")
  cat("  - Number of hot spot cells:", nrow(hotspots), "\n")
} else {
  fishnet <- fishnet %>%
    mutate(dist_to_hotspot = 0)
  cat("⚠ No significant hot spots found\n")
}
✓ Calculated distance to abandoned car hot spots
  - Number of hot spot cells: 294 

Question 4.3: Why might distance to a cluster of abandoned cars be more informative than distance to a single abandoned car? What does Local Moran’s I tell us?

  • it helps us to capture spatial dependence at regional level. (we have calculate the spatial dependence at local and neighborhood levels). It omits the influence of outliers and captures spillover effects, which improves prediction accuracy.

  • Local Moran’s I is a local measure. It shows the difference of local location from mean.

    • The High-High type shows that the location value is above mean and the neighbor value is also above mean, which can be interpreted as hot spot.

    • The Low-Low type shows that the location value is below mean and the neighbor value is also below mean, which can be interpreted as cold spot.

    • The High-Low type shows that the location value is above mean,but the neighbor value is below mean, which can be interpreted as outlier.

    • The Low-High type shows that the location value is below mean,but the neighbor value is above mean, which can also be interpreted as outlier.

Note

Local Moran’s I identifies:

  • High-High: Hot spots (high values surrounded by high values)
  • Low-Low: Cold spots (low values surrounded by low values)
  • High-Low / Low-High: Spatial outliers

This helps us understand spatial clustering patterns.


Part 5: Join Police Districts for Cross-Validation

We’ll use police districts for our spatial cross-validation.

Code
# Join district information to fishnet
fishnet <- st_join(
  fishnet,
  policeDistricts,
  join = st_within,
  left = TRUE
) %>%
  filter(!is.na(District))  # Remove cells outside districts

cat("✓ Joined police districts\n")
✓ Joined police districts
Code
cat("  - Districts:", length(unique(fishnet$District)), "\n")
  - Districts: 22 
Code
cat("  - Cells:", nrow(fishnet), "\n")
  - Cells: 1708 

Part 6: Model Fitting

Exercise 6.1: Poisson Regression

Burglary counts are count data (0, 1, 2, 3…). We’ll use Poisson regression.

Code
# Create clean modeling dataset
fishnet_model <- fishnet %>%
  st_drop_geometry() %>%
  dplyr::select(
    uniqueID,
    District,
    countBurglaries,
    abandoned_building,
    abandoned_building.nn,
    dist_to_hotspot
  ) %>%
  na.omit()  # Remove any remaining NAs

cat("✓ Prepared modeling data\n")
✓ Prepared modeling data
Code
cat("  - Observations:", nrow(fishnet_model), "\n")
  - Observations: 1708 
Code
cat("  - Variables:", ncol(fishnet_model), "\n")
  - Variables: 6 
Code
# Fit Poisson regression
model_poisson <- glm(
  countBurglaries ~ abandoned_building + abandoned_building.nn + 
    dist_to_hotspot,
  data = fishnet_model,
  family = "poisson"
)

# Summary
summary(model_poisson)

Call:
glm(formula = countBurglaries ~ abandoned_building + abandoned_building.nn + 
    dist_to_hotspot, family = "poisson", data = fishnet_model)

Coefficients:
                          Estimate   Std. Error z value             Pr(>|z|)
(Intercept)            1.889753853  0.032485490  58.172 < 0.0000000000000002
abandoned_building     0.001998030  0.000239185   8.354 < 0.0000000000000002
abandoned_building.nn -0.004631985  0.000178454 -25.956 < 0.0000000000000002
dist_to_hotspot       -0.000030309  0.000005665  -5.351         0.0000000876
                         
(Intercept)           ***
abandoned_building    ***
abandoned_building.nn ***
dist_to_hotspot       ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 6710.3  on 1707  degrees of freedom
Residual deviance: 4256.4  on 1704  degrees of freedom
AIC: 8324.7

Number of Fisher Scoring iterations: 6

Question 6.1: Interpret the coefficients. Which variables are significant? What do the signs (positive/negative) tell you?

  • Variables that are significant: abandoned_building, abandoned_building.nn, dist_to_hotspot

  • The positive coefficient for abandoned_building indicates that areas with more abandoned buildings tend to experience higher burglary counts. 

  • The negative coefficient for abandoned_building.nn suggests that abandoned buildings located closer together are associated with higher burglary counts.

  • The negative coefficient for dist_to_hotspot indicates that higher near clusters of abandoned buildings (hotspots) are associated with higher burglary counts.

Exercise 6.2: Check for Overdispersion

Poisson regression assumes mean = variance. Real count data often violates this (overdispersion).

Code
# Calculate dispersion parameter
dispersion <- sum(residuals(model_poisson, type = "pearson")^2) / 
              model_poisson$df.residual

cat("Dispersion parameter:", round(dispersion, 2), "\n")
Dispersion parameter: 2.76 
Code
cat("Rule of thumb: >1.5 suggests overdispersion\n")
Rule of thumb: >1.5 suggests overdispersion
Code
if (dispersion > 1.5) {
  cat("⚠ Overdispersion detected! Consider Negative Binomial model.\n")
} else {
  cat("✓ Dispersion looks okay for Poisson model.\n")
}
⚠ Overdispersion detected! Consider Negative Binomial model.

Exercise 6.3: Negative Binomial Regression

If overdispersed, use Negative Binomial regression (more flexible).

Code
# Fit Negative Binomial model
model_nb <- glm.nb(
  countBurglaries ~ abandoned_building + abandoned_building.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Summary
summary(model_nb)

Call:
glm.nb(formula = countBurglaries ~ abandoned_building + abandoned_building.nn + 
    dist_to_hotspot, data = fishnet_model, init.theta = 2.148065465, 
    link = log)

Coefficients:
                          Estimate   Std. Error z value             Pr(>|z|)
(Intercept)            1.935501094  0.056376456   34.33 < 0.0000000000000002
abandoned_building     0.001999655  0.000487666    4.10            0.0000412
abandoned_building.nn -0.005266039  0.000256987  -20.49 < 0.0000000000000002
dist_to_hotspot       -0.000018583  0.000008768   -2.12                0.034
                         
(Intercept)           ***
abandoned_building    ***
abandoned_building.nn ***
dist_to_hotspot       *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(2.1481) family taken to be 1)

    Null deviance: 2936.4  on 1707  degrees of freedom
Residual deviance: 1812.3  on 1704  degrees of freedom
AIC: 7271.4

Number of Fisher Scoring iterations: 1

              Theta:  2.148 
          Std. Err.:  0.135 

 2 x log-likelihood:  -7261.384 
Code
# Compare AIC (lower is better)
cat("\nModel Comparison:\n")

Model Comparison:
Code
cat("Poisson AIC:", round(AIC(model_poisson), 1), "\n")
Poisson AIC: 8324.7 
Code
cat("Negative Binomial AIC:", round(AIC(model_nb), 1), "\n")
Negative Binomial AIC: 7271.4 

Question 6.2: Which model fits better (lower AIC)? What does this tell you about the data?

  • Negative Binomial has lower AIC, which means a better performance in predicting crime.

  • This suggests that the count data are overdispersed (the variance of burglaries is much greater than their mean, which violates the Poisson assumption.

Part 7: Spatial Cross-Validation

Standard cross-validation randomly splits data. But with spatial data, this means training on cells right next to test cells (information leakage!).

Leave-One-Group-Out (LOGO) Cross-Validation trains on all districts except one, then tests on the held-out district.

Code
# Get unique districts
districts <- unique(fishnet_model$District)
cv_results <- tibble()

cat("Running LOGO Cross-Validation...\n")
Running LOGO Cross-Validation...
Code
for (i in seq_along(districts)) {
  
  test_district <- districts[i]
  
  # Split data
  train_data <- fishnet_model %>% filter(District != test_district)
  test_data <- fishnet_model %>% filter(District == test_district)
  
  # Fit model on training data
  model_cv <- glm.nb(
    countBurglaries ~ abandoned_building + abandoned_building.nn + 
      dist_to_hotspot,
    data = train_data
  )
  
  # Predict on test data
  test_data <- test_data %>%
    mutate(
      prediction = predict(model_cv, test_data, type = "response")
    )
  
  # Calculate metrics
  mae <- mean(abs(test_data$countBurglaries - test_data$prediction))
  rmse <- sqrt(mean((test_data$countBurglaries - test_data$prediction)^2))
  
  # Store results
  cv_results <- bind_rows(
    cv_results,
    tibble(
      fold = i,
      test_district = test_district,
      n_test = nrow(test_data),
      mae = mae,
      rmse = rmse
    )
  )
  
  cat("  Fold", i, "/", length(districts), "- District", test_district, 
      "- MAE:", round(mae, 2), "\n")
}
  Fold 1 / 22 - District 5 - MAE: 2.09 
  Fold 2 / 22 - District 4 - MAE: 1.73 
  Fold 3 / 22 - District 22 - MAE: 2.31 
  Fold 4 / 22 - District 6 - MAE: 2.83 
  Fold 5 / 22 - District 8 - MAE: 2.21 
  Fold 6 / 22 - District 7 - MAE: 4.01 
  Fold 7 / 22 - District 3 - MAE: 5.16 
  Fold 8 / 22 - District 2 - MAE: 2.26 
  Fold 9 / 22 - District 9 - MAE: 2.24 
  Fold 10 / 22 - District 10 - MAE: 2.06 
  Fold 11 / 22 - District 1 - MAE: 1.87 
  Fold 12 / 22 - District 12 - MAE: 3.27 
  Fold 13 / 22 - District 15 - MAE: 2.07 
  Fold 14 / 22 - District 11 - MAE: 3.06 
  Fold 15 / 22 - District 18 - MAE: 2.31 
  Fold 16 / 22 - District 25 - MAE: 2.64 
  Fold 17 / 22 - District 14 - MAE: 2.95 
  Fold 18 / 22 - District 19 - MAE: 1.92 
  Fold 19 / 22 - District 16 - MAE: 1.82 
  Fold 20 / 22 - District 17 - MAE: 1.6 
  Fold 21 / 22 - District 20 - MAE: 1.43 
  Fold 22 / 22 - District 24 - MAE: 2.03 
Code
# Overall results
cat("\n✓ Cross-Validation Complete\n")

✓ Cross-Validation Complete
Code
cat("Mean MAE:", round(mean(cv_results$mae), 2), "\n")
Mean MAE: 2.45 
Code
cat("Mean RMSE:", round(mean(cv_results$rmse), 2), "\n")
Mean RMSE: 3.36 
Code
# Show results
cv_results %>%
  arrange(desc(mae)) %>%
  kable(
    digits = 2,
    caption = "LOGO CV Results by District"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
LOGO CV Results by District
fold test_district n_test mae rmse
7 3 43 5.16 7.05
6 7 52 4.01 4.68
12 12 73 3.27 4.81
14 11 43 3.06 3.52
17 14 46 2.95 4.19
4 6 63 2.83 4.02
16 25 85 2.64 3.68
3 22 112 2.31 2.77
15 18 30 2.31 4.27
8 2 56 2.26 3.15
9 9 107 2.24 2.78
5 8 197 2.21 3.21
1 5 98 2.09 2.74
13 15 32 2.07 2.48
10 10 63 2.06 2.79
22 24 41 2.03 3.03
18 19 63 1.92 2.50
11 1 28 1.87 2.74
19 16 129 1.82 2.26
2 4 235 1.73 3.28
20 17 82 1.60 2.11
21 20 30 1.43 1.88

Question 7.1: Why is spatial CV more appropriate than random CV for this problem? Which districts were hardest to predict?

  • problems for CV here include:

    • Observations that are geographically close tend to be spatially correlated

    • the training set may include areas directly adjacent to those in the test set

    • creates spatial leakage, where the model indirectly learns from locations it is supposed to predict, so the model’s performance metrics appear overly optimistic

  • So we need spatial CV to solve these errors.

NoteConnection to Week 5

Remember learning about train/test splits and cross-validation? This is a spatial version of that concept!

Why it matters: If we can only predict well in areas we’ve already heavily policed, what does that tell us about the model’s utility?

Part 8: Model Predictions and Comparison

Exercise 8.1: Generate Final Predictions

Code
# Fit final model on all data
final_model <- glm.nb(
  countBurglaries ~ abandoned_building + abandoned_building.nn + 
    dist_to_hotspot,
  data = fishnet_model
)

# Add predictions back to fishnet
fishnet <- fishnet %>%
  mutate(
    prediction_nb = predict(final_model, fishnet_model, type = "response")[match(uniqueID, fishnet_model$uniqueID)]
  )

# Also add KDE predictions (normalize to same scale as counts)
kde_sum <- sum(fishnet$kde_value, na.rm = TRUE)
count_sum <- sum(fishnet$countBurglaries, na.rm = TRUE)
fishnet <- fishnet %>%
  mutate(
    prediction_kde = (kde_value / kde_sum) * count_sum
  )

Exercise 8.2: Compare Model vs. KDE Baseline

Code
# Create three maps
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = countBurglaries), color = NA) +
  scale_fill_viridis_c(name = "Count", option = "plasma", limits = c(0, 15)) +
  labs(title = "Actual Burglaries") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_nb), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "Model Predictions (Neg. Binomial)") +
  theme_crime()

p3 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = prediction_kde), color = NA) +
  scale_fill_viridis_c(name = "Predicted", option = "plasma", limits = c(0, 15)) +
  labs(title = "KDE Baseline Predictions") +
  theme_crime()

p1 + p2 + p3 +
  plot_annotation(
    title = "Actual vs. Predicted Burglaries",
    subtitle = "Does our complex model outperform simple KDE?"
  )

Code
# Calculate performance metrics
comparison <- fishnet %>%
  st_drop_geometry() %>%
  filter(!is.na(prediction_nb), !is.na(prediction_kde)) %>%
  summarize(
    model_mae = mean(abs(countBurglaries - prediction_nb)),
    model_rmse = sqrt(mean((countBurglaries - prediction_nb)^2)),
    kde_mae = mean(abs(countBurglaries - prediction_kde)),
    kde_rmse = sqrt(mean((countBurglaries - prediction_kde)^2))
  )

comparison %>%
  pivot_longer(everything(), names_to = "metric", values_to = "value") %>%
  separate(metric, into = c("approach", "metric"), sep = "_") %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  kable(
    digits = 2,
    caption = "Model Performance Comparison"
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover"))
Model Performance Comparison
approach mae rmse
model 2.24 3.28
kde 2.06 2.95

Question 8.1: Does the complex model outperform the simple KDE baseline? By how much? Is the added complexity worth it?

  • Sadly, the complex model perform worse than the KDE baseline. RMSE for KDE is 2.95, while RMSE for the complex model is 3.28.

  • The added model complexity does not yield better predictive accuracy. Next step might include trying another predictor, adding more predictor, or changing the spatial feature.

Exercise 9.3: Where Does the Model Work Well?

Code
# Calculate errors
fishnet <- fishnet %>%
  mutate(
    error_nb = countBurglaries - prediction_nb,
    error_kde = countBurglaries - prediction_kde,
    abs_error_nb = abs(error_nb),
    abs_error_kde = abs(error_kde)
  )

# Map errors
p1 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = error_nb), color = NA) +
  scale_fill_gradient2(
    name = "Error",
    low = "#2166ac", mid = "white", high = "#b2182b",
    midpoint = 0,
    limits = c(-10, 10)
  ) +
  labs(title = "Model Errors (Actual - Predicted)") +
  theme_crime()

p2 <- ggplot() +
  geom_sf(data = fishnet, aes(fill = abs_error_nb), color = NA) +
  scale_fill_viridis_c(name = "Abs. Error", option = "magma") +
  labs(title = "Absolute Model Errors") +
  theme_crime()

p1 + p2

Question 9.2: Where does the model make the biggest errors? Are there spatial patterns in the errors? What might this reveal?

  • Model errors perform randomly in their spatial distribution. There are a few cells recorded high value in absolute model errors, however, they are randomly dispersed.

  • This indicates that the model’s performance was not affected by spatial correlations.

Part 10: Summary Statistics and Tables

Exercise 10.1: Model Summary Table

Code
# Create nice summary table
model_summary <- broom::tidy(final_model, exponentiate = TRUE) %>%
  mutate(
    across(where(is.numeric), ~round(., 3))
  )

model_summary %>%
  kable(
    caption = "Final Negative Binomial Model Coefficients (Exponentiated)",
    col.names = c("Variable", "Rate Ratio", "Std. Error", "Z", "P-Value")
  ) %>%
  kable_styling(bootstrap_options = c("striped", "hover")) %>%
  footnote(
    general = "Rate ratios > 1 indicate positive association with burglary counts."
  )
Final Negative Binomial Model Coefficients (Exponentiated)
Variable Rate Ratio Std. Error Z P-Value
(Intercept) 6.928 0.056 34.332 0.000
abandoned_building 1.002 0.000 4.100 0.000
abandoned_building.nn 0.995 0.000 -20.491 0.000
dist_to_hotspot 1.000 0.000 -2.120 0.034
Note:
Rate ratios > 1 indicate positive association with burglary counts.

Exercise 10.2: Key Findings Summary

Technical Performance:

  • Cross-validation MAE: 3.28
  • Model vs. KDE: KDE performs better
  • Most predictive variable: abandoned_buildings.nn

Spatial Patterns:

  • Burglaries are clustered
  • Hot spots are located in suburb areas on the northwestern and southern side
  • Model errors show random patterns

Model Limitations:

  • Overdispersion: Yes
  • Spatial autocorrelation in residuals: No, model error are randomly distributed.
  • Cells with zero counts: 31% of the data are 0

Conclusion and Next Steps

ImportantWhat You’ve Accomplished

You’ve successfully built a spatial predictive model for burglaries using:

✓ Fishnet aggregation
✓ Spatial features (counts, distances, nearest neighbors)
✓ Spatial autocorrelation diagnostics (Local Moran’s I)
✓ Count regression (Poisson and Negative Binomial)
✓ Spatial cross-validation (LOGO)
✓ Comparison to baseline (KDE)

::::::::::